Introduction

Sickle cell disease (SCD) is a life-threatening autosomal recessive disorder caused by abnormal hemoglobin S, leading to rigid, lysis-prone red blood cells. These cells cause hemolytic anemia, vaso-occlusion, and progressive organ damage. Challenges for personalized medicine in SCD include limited high-quality evidence and lack of large high quality standardized datasets to derive evidence from. Additionally, reliance on reference data from predominantly white populations reduces the accuracy of genetic and metabolic studies for diverse SCD populations.

SCD research in Europe is constrained by small, heterogeneous, and dispersed patient cohorts. AI offers potential to analyze large datasets, but requires standardized, high-quality data for robust model training.

This work presents a data validation and quality framework, aligned with FAIR principles (Findable, Accessible, Interoperable, Reusable) and EMA recommendations, for building a cross-border European SCD multi-modal real-world dataset integrating clinical, functional, and -OMICs data from five repositories in four countries.

Methods

The SCD use case protocol was defined by a multidisciplinary team, including clinicians, biostatisticians, and patients. It addressed hemolytic anemia, vaso- occlusion and inflammation as disease hallmarks and defined clinical outcomes: acute painful vaso-occlusive crisis (VOC), acute chest syndrome (ACS), and kidney disease (KDIGO criteria). Multi-modal datasets include clinical, rheological, genomic (GWAS), and metabolomic data.

Clinical data collection and standardization was done using an electronic case report form (eCRF) comprising 275 parameters, resulting in 603 variables, including demographic, medical history, genetics, laboratory, acute events, organ damage, radiological, and treatment data.

Patient individual data were processed from 17 centers using 5 repositories across 4 countries: Italy (1 center, custom web platform), France (2 centers, web platform and Excel file), Spain (10 centers, REDCap), and the Netherlands (4 centers, Castor).

Data harmonization followed RADeep standards for rare anemia disorders, applying international vocabularies and aligning with ERDRI (European Rare Disease Registries Infrastructure) and ENROL recommendations for interoperability.

Specific ETL (Extract, Transform, Load) process was developed for each repository. Data quality was assessed across four dimensions: integrity, completeness, consistency, and accuracy.

Integrity was evaluated via detection of unexpected records, duplicates, and mismatches. Completeness was assessed through missingness and non-response rates. Consistency was checked using data quality rules to identify inadmissible values and logical restrictions. A data quality report was shared with centers to resolve inconsistencies.

Data accuracy was evaluated through statistical significance of well-known disease markers on the risk of clinical outcomes occurrence applying GLMM.

Results

The European multi-modal SCD dataset includes standardized clinical data on 1,346 variables from 1,274 patients (ES=297, FR=483, IT=116, NL=378). For AI readiness, a quality framework was applied to 235 variables. Completeness ≥80% was achieved in 75.7% of them.

Accuracy of the final dataset was demonstrated by the significance of lower red blood cell count and higher serum ferritin association with increased odds of experiencing ≥1 VOC in the past 24 months (p=0.035, p=0.012). Similarly, higher serum ferritin and elevated red cell distribution width were linked to increased odds of ≥1 ACS (p<0.001, p=0.015), while albuminuria was significantly associated with the presence of kidney disease (p<0.001).

Validated clinical data were linked via unique patient identifiers to rheological, genomic, and metabolomic datasets, processed through standard pipelines and transformed into tabular format.

Conclusions

The work presented results in the first large European SCD clinical dataset with the capability of being merged with other type of data to build the largest multi-modal European SCD dataset including clinical, laboratory, rheological, genetics data and metabolomics for 1274 SCD patients with enough quality to explore the development of AI solutions to improve SCD clinical management through a personalized precision medicine approach based on prognosis of the disease and decision making and even the enlargement of specific cohorts by models on synthetic multimodal data generation.

This content is only available as a PDF.
Sign in via your Institution